Filter pandas DataFrame by substring criteria

2024-03-20 21:25| 来源: 网络整理| 查看: 265

Introduction

Pandas is a popular Python library for data analysis and manipulation. The DataFrame is one of the key data structures in Pandas, providing a way to store and work with structured data in a tabular format. DataFrames are useful for organizing and storing data in a consistent format, allowing you to perform operations on the data such as filtering, grouping, and aggregation.

DataFrames can be created from various sources, such as a list of lists, a dictionary, or an existing DataFrame. Here is an example of how to create a DataFrame from a list of dictionaries:

Python3

import pandas as pd # List of dictionaries containing data data = [ {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd,West New York, NJ 07093"}, {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, ] # Create the DataFrame df = pd.DataFrame(data) print(df)

Output:

Name Age Address 0 John Smith 35 123 Main St, New York, NY 10001 1 Jane Doe 28 456 Park Ave, Newark, NJ 70004 2 Joe Schmo 55 789 Broad Way, Jersey City, NJ 07306 3 Sally Smith 42 321 Maple St, Hoboken, NJ 07030 4 Bob Johnson 28 654 Cedar Blvd, Union City, NJ 07087 5 Sue Johnson 29 912 Oak St, Weehawken, NJ 07086 6 Bill Williams 33 245 Pine Rd, West New York, NJ 07093 7 Mary Johnson 25 369 Birch Ave, Guttenberg, NJ 07093 8 Tom Williams 44 159 Willow St, Hoboken, NJ 07030 Filter the Dataframe

Now suppose you need to find the list of users with the age of 28. You can do the same by writing the following code:

Python3

df = df.loc[df["Age"] == 28] print(df)

Output:

This would filter the DataFrame to only include rows where the “Age” is 28 and produce the following output:

Name Age Address 1 Jane Doe 28 456 Park Ave, Newark, NJ 70004 4 Bob Johnson 28 654 Cedar Blvd, Union City, NJ 07087

This seems pretty easy. So what if we need to find the users with addresses in New York? Since the address column contains information like street, city, zip code, etc. we need to use the substring filter operations for the same

Method 1: Using loc with str.contains

To filter the DataFrame using a substring in the “Address” column, you can use the .loc[] method and specify the desired substring in the filter criteria using the .str.contains() method:

Python3

import pandas as pd # List of dictionaries containing data data = [ {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"}, {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, ] # Create the DataFrame df = pd.DataFrame(data) # Filter the DataFrame to only include rows where the Address contains the substring "New York" df = df.loc[df["Address"].str.contains("New York")] print(df)

Output:

This would filter the DataFrame to only include rows where the “Address” column contains the substring “New York” and produce the output as follows:

Name Age Address 0 John Smith 35 123 Main St, New York, NY 10001 6 Bill Williams 33 245 Pine Rd, West New York, NJ 07093 Method 2: Using `query` method

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .query() method and specify the desired substring in the filter criteria:

Python3

import pandas as pd # List of dictionaries containing data data = [ {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"}, {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, ] # Create the DataFrame df = pd.DataFrame(data) # Filter the DataFrame to only include rows where the Address contains the substring "New York" df = df.query("Address.str.contains('New York', case=False)") print(df)

Output:

Name Age Address 0 John Smith 35 123 Main St, New York, NY 10001 6 Bill Williams 33 245 Pine Rd, West New York, NJ 07093 Method 3: Using `apply` method

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .apply() method and specify a custom function that returns True if the “Address” column contains the desired substring, and False otherwise:

Python3

import pandas as pd # List of dictionaries containing data data = [ {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"}, {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, ] # Create the DataFrame df = pd.DataFrame(data) # Define a custom function to check if the Address contains the substring "New York" def contains_new_york(address): return "New York" in address # Filter the DataFrame to only include rows where the Address contains the substring "New York" df = df[df["Address"].apply(contains_new_york)] print(df)

Output:

Name Age Address 0 John Smith 35 123 Main St, New York, NY 10001 6 Bill Williams 33 245 Pine Rd, West New York, NJ 07093 Method 4: Using map method

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .map() method and specify a custom function that returns True if the “Address” column contains the desired substring, and False otherwise:

Python3

import pandas as pd # List of dictionaries containing data data = [ {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"}, {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, ] # Create the DataFrame df = pd.DataFrame(data) # Define a custom function to check if the Address contains the substring "New York" def contains_new_york(address): return "New York" in address # Map the custom function to the Address column to create a new column containing the result of the function df = df[df["Address"].map(contains_new_york)] print(df)

Output:

Name Age Address 0 John Smith 35 123 Main St, New York, NY 10001 6 Bill Williams 33 245 Pine Rd, West New York, NJ 07093 Method 5: Using `filter` method

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .filter() method and specify the desired substring in the filter criteria using the like operator:

Python3

import pandas as pd # List of dictionaries containing data data = [ {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"}, {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, ] # Create the DataFrame df = pd.DataFrame(data) # Filter the DataFrame to only include the mentioned columns df = df.filter(like="Address") print(df)

Output:

Address 123 Main St, New York, NY 10001 456 Park Ave, Newark, NJ 70004 789 Broad Way, Jersey City, NJ 07306 321 Maple St, Hoboken, NJ 07030 654 Cedar Blvd, Union City, NJ 07087 912 Oak St, Weehawken, NJ 07086 245 Pine Rd, West New York, NJ 07093 369 Birch Ave, Guttenberg, NJ 07093 159 Willow St, Hoboken, NJ 07030

The .filter() method is similar to the .query() method in that it allows you to specify filter criteria. However, the .filter() method only filters columns, whereas the .query() method can filter both columns and rows.

Method 6: Using `isin` method

Another method you could use to filter the DataFrame using a substring in the “Address” column is to use the .isin() method and specify the desired substring in the filter criteria:

Python3

import pandas as pd # List of dictionaries containing data data = [ {"Name": "John Smith", "Age": 35, "Address": "123 Main St, New York, NY 10001"}, {"Name": "Jane Doe", "Age": 28, "Address": "456 Park Ave, Newark, NJ 70004"}, {"Name": "Joe Schmo", "Age": 55, "Address": "789 Broad Way, Jersey City, NJ 07306"}, {"Name": "Sally Smith", "Age": 42, "Address": "321 Maple St, Hoboken, NJ 07030"}, {"Name": "Bob Johnson", "Age": 28, "Address": "654 Cedar Blvd, Union City, NJ 07087"}, {"Name": "Sue Johnson", "Age": 29, "Address": "912 Oak St, Weehawken, NJ 07086"}, {"Name": "Bill Williams", "Age": 33, "Address": "245 Pine Rd, West New York, NJ 07093"}, {"Name": "Mary Johnson", "Age": 25, "Address": "369 Birch Ave, Guttenberg, NJ 07093"}, {"Name": "Tom Williams", "Age": 44, "Address": "159 Willow St, Hoboken, NJ 07030"}, ] # Create the DataFrame df = pd.DataFrame(data) # Filter the DataFrame to only include rows where the Address contains the mentioned string df = df[df["Address"].isin(["123 Main St, New York, NY 10001"])] print(df)

Output:

Name Age Address 0 John Smith 35 123 Main St, New York, NY 10001

The .isin() method allows you to specify a list of values that the “Address” column should contain in order for the row to be included in the resulting DataFrame. In this case, the list contains only the value “123 Main St, New York, NY 10001”, so only rows where the “Address” column contains that exact value would be included in the resulting DataFrame.

To filter a Pandas DataFrame using a substring in any specific column data, you can use one of several methods, including the .loc[], .query(), .filter(), .isin(), .apply(), and .map() methods. The specific method you choose will depend on your personal preference and the specific requirements of your project.

Last Updated : 05 Feb, 2023 Like Article Save Article Previous Escape From \n Newline Character in Python Next Massively Speed up Processing using Joblib in Python Share your thoughts in the comments Please Login to comment...

【本文地址】

公司简介

联系我们